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ABSTRACT 

Two papers are included as Part Four of this report 
on Saiton's Magical Automatic Retriever of Texts (SMART) project 
report. The first paper: •'A Controlled Single Pass Classification 
'Algorithm with Application to Multilevel Clustering 1 * by D. B. Johnson 
and J. M. Laferente presents a single pass clustering method which 
Compares favorably with more expensive clustering algorithms. The 
method is tested using the ADI collection of 82 documents and the 
Crahfield 424 Collection. The results are compared to full search and 
to results obtained by searching clusters produced by Dattola’s 
algorithm. The second paper: 11 A Systematic Study of Query-Clustering 
Techniques: A Progress Report 11 by S. Worona describes an experiment 
using various techniques of query clustering on the Cranf ield 424 
document collection and gives some preliminary results. Several 
methods of evaluating the performance of clustered searches in the 
context of query- clustering are discussed. Some observations are also 
made concerning use of the SMART system as implemented at Cornell 
University. (For the entire SMART project report see LI 002 719# for 
parts 1-3 see LI 002 720 through LI 002 722, for part 5 see LI 002 
724.) (NH) 
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ERIC User Please Note: 

This summary discusses all 5 parts of Information Storage 

r a ? d nn etrieval (ISR " 18 >» which is available in its entirety as 
LI 002 719. Only the papers from Part Four are reproduced here 
as LI 002 723. See LI 002 720 thru LI 002 722 for Parts 1 - 3 
and LI 002 724 for Part 5. 

Summary 

The present report, is the eighteenth in a series describing research 
in automatic information storage and retrieval conducted by the Department 
of Computer Science at Cornell University. The report covering work carried 
oiit by the SMART project for approximately one year (summer 1969 to summer 
1970) is separated into five parts: automatic content analysis (Sections 

I to IV)* automatic dictionary construction (Sections. V to VI i) , user feed- 
back procedures (Sections VIII to XI) , document and query clustering methods 
(Sections. XII and XIII)., and SMART systems design for on-line operations 
■(Sections XIV and XV) • 

Most recipients of SMRT project reports will experience a gap in 
the series of scientific reports received to date* Report ISR-17, consisting 

of a master's thesis by Thomas Brauen entitled "Document Vector Modification 

■ - ' - _ * ' t * , * -(i „ „ 

ihOn-Iine Information Retrieval Systems” was prepared for limited distribu- 
tion. during the fall of 1969. Report ISR-17 is available from the National 
Technical Information Service in Springfield, Virginia 22151, under order 
number PB 186-135. 

The SMART system continues to operate in a batch processing mode 
On the IBM 360 model 65 system at Cornell University. The standard processing 
mode is eventually to be replaced by an onrline system using time-shared 
console devices for input and output. The overall design for such an on-line 
version of: SMART has been completed, and is described in Section XIV of the 
present report. While awaiting the time- sharing implementation of the 
system, hew retrieval experiments have been performed using larger document 
collections -within the. existing system. Attempts to compare the performance 
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of several collections of different sizes must take into account the 
collection "generality”. A study of this problem is made in Section II, of 
the present report. Of special interest may also be the hew procedures 
for the automatic recognition of "common” words in English texts (Section 
VI) , and the automatic construction of thesauruses and dictionaries for use 
in an automatic language analysis system (Section VII) . Finally, a new 
inexpensive method of document classification and term grouping is 
described- and evaluated in Section XII of the present report. 

Sections I to IV cover experiments in automatic content analysis 
and automatic indexing. Section I by S. F. Weiss contains the results of 
experiments^ using statistical and syntactic procedures for the automatic 
recognition of phrases in written texts. It is. shown once again that be- 
cause of the relative heterogeneity of most document collections , and 
the sparseriess of the document space, phrases are hot normally needed 
for content identification. 

In Section II by G. Sal ton, the "generality" problem is examined 
which arises when two or more distihct collections are compared in a 
retrieval environraenti It is shown that proportionately fewer nohrelevant 
itemstendtp retrieved when larger collections (of low generality) 
are used, than when small, high generality collections serve for evaluation 
purposes, ^e systems viewpoint thus normally favors the larger, low 

' ' * 4 

^ . 

generality output, whereas, the user viewpoint prefers the performance of 

' '. * * ’ • * ' * * * 

the smailer collection. V-' 

? -i * ' ^ 

The effectiveness of bibiiographic citations for content analysis 
purposes^ Is examined in Section ill by G. Sal ton. It is shown that in 



some situations when the citation space is reasonably dense,, the use of 
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citations attached to documents is even more effective than the use of 

standard keywords or descriptors. In any case, citations should be added 

r 

to the normal descriptors whenever they happen to be available. 

r 

. * 

In -the last section of Part 1, certain template analysis methods 

* ; 

are applied to the automatic resolution of ambiguous constructions 
(Section, IV by S. F., Weiss) . It is shown that a set of contextual rules 
can be constructed by a semi-au tomati c learning process, which, will eventually 
lead to an automatic recognition of over ninety percent of the existing 
textual ambiguities . 

Part 2, consisting of Sections V, VI and VII covers procedures 
for. the automatic construction of dictionaries and thesauruses useful in 
text analysis Systems. In, Section V by D. Bergmark it is shown that word 
stem methods using large common, word - lists; are more ef fective in an infor- 
mation retrieval, environment that some manually constructed thesauruses , 
even though; the latter also include synonym recognition facilities. 

. A new ihodel for Idle, automatic determination of "common" words 
(Which are not to be used for content identification) is proposed and 
evaluated in Section Vl by K. Bonwit and. J. Aste-Tonsmann. The resulting 

process can be .incorporated into fully automatic dictionary construction 

* * * 

systems. Thecomplete thesaurus construction problem is reviewed in Section 

VII by G., Salton, and the effectiveness of a variety of automatic dictionaries 

is evaluated. •- •• - ' 

Part 3, consisting of Sections VIII through XI, deals with a 
number of refinements of the normal relevance feedback process which has 
been- examined in a number of previous reports in this series. In Section 

VIII by T. P. Baker, a query splitting process is evaluated in which input 

• .• . : / Xvii . 17 
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queries are split into two or more parts during feedback whenever the 
relevant documents identified by the user are separated by one or more non-* 
relevant ones* 

The effectiveness of relevance feedback techniques in an environ- 
ment of variable generality is examined in Section IX by B. Capps and M. 
Yin* It is show that some of the feedback techniques are equally applica- 
ble to collections of small and large generality. Techniques of negative 
feedback (when no relevant items are identified by the users, but only 
nonre levant ones) are considered in Section X by M. Kerchner. It is shown 
that a number of selective negative techniques, in which only certain 






concepts are actually modified during the feedback process, bring. 

i • * s > 

good improvements in retrieval effectiveness over the standard nonselective 
methods; f 

Finally, a new feedback methodology in which a number of documents 
jointly identified as relevant to earlier queries are used as a set for 
relevance feedback purposes is proposed and evaluated in Section XI by L. 
Paavola; ' ■ ‘ 

Two hew clustering techniques are examined in Part 3 of this report, 
consisting of Sections XII and XIII; A controlled, inexpensive, single-pass 
clustering algorithm is described and evaluated in Section XII by p. B. 
Johnson and J; M. Lafuente. in this clustering method, each document is 

-i ■■■■ : • ' ' . ' •' 

examined only once, and the procedure is shown to be equivalent in certain 

? * * ,x . ' V- .* ,/ * s * ‘ 

k , " ' 1 __ < 

circumstances to other more demanding clustering procedures. 

• • . ' . ,! -- ' ■ " • -V 

r . ■ ■ ; V 

The query clustering' process, in which query groups are used to 
define the information search strategy is studied in Section XIII by S. 

I \ * , t v ” • * • 

Woroha. A variety of parameter values is evaluated in a retrieval environ- 
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©ent to be used for cluster generation, centroid definition, and final 
•search strategy. 

The last, part, number five, consisting of Sections XIV and XV, 
covers the design of on-line information retrieval systems. A new 

i 

i 

SMART sysftem design for on-line use is proposed in Section XIV by D. and 

' ' * 

V 

R. Williamson, based on the concepts of* pseudo-batching and the interaction 
of a cycling program with a console monitor. The user interface and 
conversational facilities are also described. 

A template analysis technique is used in Section XV by S. F. Weiss 

* « 

for the implementation of conversational retrieval systems used in a time- 
sharing environment. The effectiveness of the method is discussed, as 
well as its implementation in. a retrieval situation. . 

Additional automatic content analysis and search procedures used 

with the SMART system are described in several previous reports in this 

« * • 

series, including, notably reports ISR-11 to iSR-16 published between 1966 

' %. V ' * 

and, 1969. These reports are all available from the National Technical 
information Service in Springfield, Virginia. 

G. Sal ton 
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XII. A Controlled Single Pass Classification Algorithm 
.with Application to Multilevel Clustering 

D, B. Johnson and J. M. Lafuente 



Abstract 

A single pass clustering method is presented, which compares 
favorably with more expensive clustering algoritms . During the clustering 
process various parameters are controlled, such as number of clusters, 
size of the clusters and amount of overlap. The method is tested, using 
the ADI collection of 82 documents and the Cranfield 424 collection. The 
results are compared to full search and to results obtained by searching 
clusters- produced by Dattola * s algorithm. The effect of ordering of the 
collection is investigated and some variation is obtained in the; results 
for different orderings . Single-level as well as two-level clustering is 
considered. The results,, in general, point to better performance with 
multilevel clustering and some suggestions for extending the algorithm to 

include; multilevel clustering are given 

„ . ' * ' , . 

1. Introduction 

An important (Consideration in an automatic information retrieval 
system is the time spent in searching a collection. To avoid searching 

the; entire collection, it becomes necessary to classify documents into 

. i * 

< x , '* \ ' ' v 

related groups. This is the technique of clustering. Documents are grouped 
into clusters by assigning items containing similar concepts to the same 
cluster, A centroid vector is constructed for each, cluster, and queries 
are matched at first against these centroids.. Only those clusters comparing 
favorably with a query are then searched in the normal manner. Thus a sac- 
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rifice in time to produce the. clusters is compensated by later savings 
in the search, time.. The problem is how to develop efficient techniques for 
producing meaningful clusters in order to minimize searching costs. The 
suitability of any clustering method must then be measured according to 
the following criteria: 

1. Cost of generating the clusters; 

2. Cost of searching the clustered collection; 

3. Effectiveness of. the search process, usually measured 
by evaluating recall and. precision. 

The clustering problem becomes critical with very large collections. 
Comparing each document with every other document is no longer feasible, 
and efficient algorithms have been developed- which attempt to minimize the 
number of document comparisons . Even with these methods,, the classification 
of a collection containing several thousand items is a time consuming pro- 
cess.. . • 

This paper describes a clustering algorithm which makes a single 
pass over a collection^ Each document is examined only once and clusters 
are formed in the process. A document is considered for inclusion into one 
or more existing clusters before; it is allowed to begin a cluster of its 
own;. Various parameters such as cluster size, number of clusters, and amount 
of overiap are, controlled throughout the clustering process. The algorithm 
is testedusing theADI collection of 82 documents and 35 queries available 

in the SMART system. The algorithm, however, is designed with a view toward 

^ , ’ * « *" * 

farge collections. 
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2. Methods of Clustering 

Various; methods have been devised for clustering.. Usually these 
methods require the computation of a correlation matrix, representing the 
correlation of each document with every other document in the collection, 
followed by a grouping, of those documents which correlate best with each 
other. Input parameters for these clustering algorithms include the number 
of clusters desired, the maximum size of each cluster, the amount of over- 
lap and the number of "loose” or unclassified, documents to be allowed . 

Two clustering methods, are presently in. use in. the SMART System; 

[1] they are: Rocchio’s clustering algorithm [2] and a variation of 

Doyle* s algorithm. [3] In Rbcchio’s algorithm, each unclustered document is 
selected as a, candidate for a cluster nucleus . All remaining documents are 
then correlated with it, and the document is subjected to a density test 
based on cut-off correlation coefficient s* If the document passes the 
density test , a new cluster is formed and a cut-off correlation is determined 
based on the relative distribution of correlations with the given document. 

A centroid vector is then computed by combining all concepts of those docu- 
ments with correlation above the computed cut-off value. The centroid vector 
is next matched against the entire collection to create an altered cluster. 

The entire procedure is now repeated with all unclustered documents until 
all documents are either clustered, or loose . 

Doyle*s algorithm basically consists in .matching documents to existing 
clusters by computing a document -cluster score for each document relative to 
each cluster and admitting a document to those clusters for which a suffi- 
ciently high score is recorded. New centroids are then computed for each 
altered cluster. The process is then repeated; at each step of the itera- 
tion all the documents are correlated with all the clusters, and the clusters 



are updated until further updating does not alter any of the existing clusters, 

2 

It can be shown that Rocchio’s algorithm requires order N vector 
comparisons, where N is 1 the humber of documents, while Doyle* s algorithm 
is of' the order N*m where m is the number of clusters . A more efficient 
method due to Dattola : (-4 j 'requires time proportional to N^p-log^m where N 
is the number of items in the collection, m is the humber of dusters 



desired and p is the number of clusters produced at each level of the 
algorithm. The method is an oiitgrowth of Doyle’s attempt to obtain a fast 
algorithm for clustering large document collections. In each cycle of Dattola* s 
algorithm, each document in the* collection is scored against each existing 



cluster by a certain scoring function. New dusters are then computed while 
some documents remain loose . The cycle is then repeated with the new clus- 
ters i The algorithm' is designed to control the number of clusters, size 
of clusters and amount of overlap. The number of clusters and amount of 
overlap are specified as input parameters , While the size of the clusters 
is controlled internally. One problem with Dattola *s algorithm is that some 
way must be found to designate initial clusters. 



■An inexpensive one-pass clustering method has been proposed by Rieber 
and Marathe. [5] In this method the first document automatically becomes the 
centroid of- the first cluster. Subsequent documents are correlated with 
existing clusters and depending on how the correlation with each cluster 
compares with the minimum correlation cut-off, the document is either ad- 
mitted to one or more existing clusters or allowed to start a new cluster. 

If a document is admitted to an existing cluster, the cluster centroid is 
recomputed. "The method allows for disjoint clusters where a document can 
only be included in One cluster, or overlapped clusters where a document 
is included in every -cluster with which it has a high correlation. The 
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single pass method of Rieber and Marathe compares favorably with more 
complex clustering methods in search cost, but there is no control on 
the cluster size or the number of clusters generated. This results in 
initial clusters being exceptionally large. Moreover, the process is 
likely to be order-dependent since the formation of the clusters depends 
on how the documents are encountered. 

3; Strategy 

To produce retrieval results for the user, two major costs are 
incurred: the cost of preparing the collection, and the cost of searching. 

Search cost can be reduced if an investment is made in clustering the 
collection. The aim of this study is to give a clustering algorithm which 
operates substantially more cheaply than those presently in use but for 
which the search costs are similar. In this way it is possible to compare 
clustering cost directly with search effectiveness. Alternatively, search 
parameters can be adjusted until search effectiveness is comparable, 
yielding a direct comparison of clustering and search costs . In either 
case, it may be possible to exhibit the extent to which the clusters are 
less optimal than those of other algorithms. 

One approach to keeping search costs low is to use an algorithm . 
Which, within the constraint of a single pass over the document collection, 
will produce on the basis of document vector similarities a set of clusters 
of a given size distribution measured in terms of mean size, maximum size 
and overlap. This aim is achieved by the algorithm described in this study. 
The extent to which sets of clusters with similar size distributions have 
similar sea.rch costs is discussed in Section 6. C. 
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Specifically, the experiments are designed to allow a comparison 
with Dattola’s algorithm. In his work [4], a mean cluster size is chosen* 
Clusters more than twice or less than half the mean size are not allowed. 

In. the one-pass algorithm described in this study, the mean and maximum are 
easily controlled. The minimum is not controlled directly, since doing so 
requires blending small clusters in a second pass. It is questionable 
whether fixed limits on the cluster size are desirable. Certainly some sort 
of upper limit is needed to keep search costs well below that for a full 
search. The effect of a lower limit, given a mean and maximum, is less clear, 
in any event the method does not control the minimum whereas Dattola’s does. 

The composition of clusters from a single-pass method depends on 
the order in which the documents are processed. A document can not be placed 
in a cluster with a sequence number higher than that of the document . For 
example,, if there are' N documents in the collection and n clusters are 
produced, the first n-1 documents cannot belong to cluster n . In general, 
it will also be more difficult for a document N to join cluster 1 than 
for a document with ah earlier sequence number. 

The effect of order can be controlled partially and indirectly by con- 
trolling the rate at which clusters are allowed to form in the early part of 
the pass. Otherwise, order dependency is inherent in the single-pass method ■ 
.just as it is in other methods invhich the number of iterations is limited. 

The degree of order dependency of the algorithm of this study is discussed 
in Section 6. B. 

4. The Algorithm 

The clustering algorithm accepts input vectors, each describing a 
document, and assigns each document either to one or more existing clusters 

25 



or to a new cluster, depending on the correlation of the input vector with 
the cluster centroids. Each document vector is of the standard form, con- 
sisting of pairs of concept numbers and corresponding weights. The weight 
of each term in a centroid is obtained by summing the weights of that term 
for all documents in the centroid. 

Vector correlations are computed as the multidimensional cosine 
between them, COS, as follows 
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where 

COS = the cosine correlation between vectors u and v 
n = number of terms in the collection dictionary. 

COS ranges from 0 to 1. In order to control the cluster size, COS is modi- 
fied, as discussed below. It is the modified correlation, COR, which is 
actually used in making clustering decisions. 

One stage of the clustering process is defined as the comparison 
of one input document with all existing clusters. If the correlation, COR, 
of the document with the centroid of a cluster is greater than the cut-off 
value j CORCUT, the document is added to the cluster with which it has the 
best correlation. The document is also added to any clusters for which COR 
lies ho more than an amount GAP below the highest correlation (maximum over 
all clusters) COR, thus producing overlap. If GAP = 0, no overlap occurs. 
If a document is not admitted to any cluster, it defines a new one. A cen- 
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troid is recomputed whenever a document is added. 

Number of clusters, cluster size and overlap are controlled dynami- 
cally. , When, during the course of clustering, the number of clusters compared 
to the number of documents already processed becomes large, it is necessary 
to make, it easier in general for documents, to join clusters . However, to 
control, cluster size it must, be generally more difficult for a document to 
join a large cluster than, a small one,. To achieve these ends, CORCUT is 
varied to control the number of clusters while individual correlations, COS, 
are reduced by an amount related to cluster size in order to control cluster 
size near some maximum value . 



A) Cluster Size 

It is desired to define a function COR which will depend on the cosine 
correlation, COS, and also on the number of documents which the cluster would 
contain if the new document were admitted. COR should increase with COS and 
decrease as cluster size increases. If COS is very high, however, it would 
be unreasonable to exclude the document even from a large cluster. Therefore, 
when - COS is 1, COR. should equal 1 as well. 

■The following function, chosen for the algorithm, meets the above 
requirements: 

COR = COS y 



.where. 




y = NCEIL/ (NCEIL - min (NCEIL - 1/B, M. + M )+A)/A 
NCEIL = cluster size ceiling requested by user 

M. = number of documents in input vector 

1 . 

(M^ = 1 unless clusters are being clustered) 

M = number of documents in cluster 
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A,B = tuning parameters which the algorithm supplied 

and the user in general need not be concerned with. 

Parameter A controls the rate at which the ratio COS/COR grows with cluster 
size when a cluster is small* Parameter B controls this ratio near and 
beyond the cluster size limit. If B is small, clusters can actually grow 
over the limit given by the user . 

B) Number of Clusters 

•Following each stage, CORCUT is recomputed in order to control the 
number of clusters finally produced . The ratio of clusters produced to 
documents processed up to the moment is computed* If this ratio is larger 
than the value desired in the final result , CORCUT is adjusted downward 
from a base value FCOR, reducing the probability that a new cluster will be 
generated in the next stage, if the ratio is smaller than the desired value, 
CORCUT is raised. The base value FCOR moves toward CORCUT at a rate fixed 
by the algorithm* The user may supply a value for this rate . 

New values of CORCUT and FCOR are computed as follows : 

CORCUT. = (FCOR. _ ) X 
1 i-l 

FCOR. = FCOR. *R + C0RCUT.*(1 - R) 

1 l-l l 

where 

x = (NCL - NCLREQ * NINPUT / NE + D) / D 
NCL = number of clusters following stage i-1 
NCLREQ = number of clusters requested by user 

NE = number of documents in source collection 
NINPUT = number of documents input through stage i-1 

R = parameter controlling rate at which FCOR follows CORCUT 
D = tuning parameter set by algorithm. 



If R=0, CORCUT varies widely and poorer clusters are produced. 



C) Overlap 

A document is added, to a cluster only when two conditions are met : 

1. COR > CORCUT 

2. max Cover all clusters) COR - COR < GAP 

It can be seen that GAP controls overlap. Between stages GAP is 



D) An Example 

The following example illustrates how the clustering parameters are 



correlations are computed. The values presented are taken from a run on the 



adjusted as follows: 



GAP. = (FGAP . ) Z 

1 1 — * «L 



where 



z = CM, t ... t M.._ f + E ) / CNINPUT* COVLAP + 1) + E) 
' 1 , NCL 



OVLAP = the value requested for 




FGAP = base value of GAP 



FGAP. = FGAP. ,*R t GAP.*(1 - R) 
i l-l i 



adjusted dynamically during the clustering process and how the individual 



ADI collection. User selected parameters are given as follows: 



NCEIL = 15 
NCLREQ = 9 
OVLAP = .122 



Default options ai*e used for the other parameters , namely; 



FCOR = 


.4 


A = 


40 


FGAP a 


.001 


B = 


2 


R = .9 




D = 


5 






E = 


1 



Fig. i shows how CORCUT varies during clustering. Fig. 2 gives 
a similar presentation for GAP.. Of course, CORCUT and GAP change in dis- 
crete steps after each document has been clustered. Points in the figures 
have been connected for ease of presentation. 

Consider, for example, the input of the 65th document. Seven 
clusters exist at this point , so COS and COR are computed with respect 
to each cluster, giving the following results: 



Cluster 

Number 


Number of Documents 
in Cluster 


COS 


y 


COR 


1 


14 


.458 


2.5 


.142 


2 


10 


.080 


1.3 


.037 


3 


14 


.498 


2.5 


.175 


4 


12 


.214 


1.5 


.010 


5 


12 


.205 


1;5 


.009 


6 


1 


.232 


1.17 


.181 


7 


1 


.050 


1.17 


.030 
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Number of Documents Input 
Variation of Corcut with Input 

Points at which new clusters form are marked with 
cluster numbers. First three points are off scale. 

Fig. I 

Gap 




Number of Documents Input 
Variation of Gap with Input 

Each downward discontinuity in the plot occurs when a 
document is assigned to multiple clusters 

Fig. 2 
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Taking the values of CORCUT = .045 and GAP = .0089 as adjusted following 
the 64th document* document 65 is admitted to clusters 3 and 6 as follows: 



Cluster Number 
6 

3 
1 
2 
7 

4 

5 



Cor Admit 

.181 * 

.175 * 

.142 <; 

.037 < “ ” 

.030 
.010 
.009 



.181-Gap =.172 
CORCUT = .046 



5. Implementation 

The algorithm is implemented in Fortran. The user 1 specifies star- 
ting, values as follows: 



NCEIL = approximate maximum number of documents desired 
in any cluster 

NCLREQ = number of clusters desired 
0VLAP = fractional overlap desired 

The algorithm sets the following parameters: 

FCOR = .4 A = 40 

FGAP = .001 B = 2 

R = .9 D = 5 

E = 1 



The user may override these values if he desires. It is expected 
that extensive use of the algorithm would lead to better values than those 
already found, although the algorithm is not highly sensitive to them. 





A) Storage Management 

The algorithm is designed with a veiw toward application to large 
collections . Core storage and accesses to secondary storage are minimized 
in the following ways. Clusters are stored in sequential locations rather 
than in, a 2 -dimensional array. Only sufficient core storage for the clus- 
ters as a group heed be alloted regardless of the variation in cluster size. 

A linked list could also be used* However, if secondary storage has to be 
used to store part of the cluster information, sequential storage is them 
preferable » 

Sequential storage requires moving the cluster information in order 
to insert new, concept -weight pairs. To minimize moves, two consecutive input 
vectors, are kept in core. In the course of one stage of the algorithm, the 
input to the, previous stage is added to the appropriate clusters and corre- 
lation of the current document is made, simultaneously. The entire cluster 
collection is moved at most once for each stage. 

6 . Result s 

This' study employs two document collections to evaluate the perfor- 
mance of the single-pass algorithm, the ADI collection of 82 documents and 
35 queries and the Cranfield collection of 424 documents and 155 queries. 
Evaluation is made by comparing the results using the present algorithm with 
the results using Dattola’s algorithm. Several clustering and search runs* 

*In the recall-presicion curves presented, the modifications proposed by 
Dattola , [4] ppt 16-24, to compensate for variations in correlation percen- 
tage and uniform distribution of unrecovered relevant documents are incor- 
porated. 
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were made using similar input parameters in both algorithms. Full search 
results on the ADI collection are also shown for comparison. 

The cost of clustering is first discussed in Section 6. A. Several 
clustering runs were made using the single-pass algorithm over different 
orderings of the ADI collection to show the effects of the order in which 
documents are processed. This is discussed in Section 6. B. 

Clustering is also done at two levels as well as one level to illus- 
trate savings in clustering time and to show the effect of multilevel clus- 
tering on cluster quality. The results of searching the ADI collection, 
including multilevel search, are discussed in Section 6. C. Finally, the 
algorithm is applied to the Cranfield 424 collection and the results are 
discussed in Section 6. D. 



A) Clustering Costs 

Cost comparisons between clustering methods can be made according 
to several criteria. The two for which results are presented in this 
study are: 

1. Number of vector comparisons performed. 

2. Computer resources used, mainly CPU and I/O time. 




The first criterion allows comparison of algorithms to a large 
degree independently of the programming techniques employed and the system 
in which the programs are embedded. The second reflects the effects of 
the system and the programming techniques as well. 

Consider the number of vector comparisons. It is convenient to 
assume that clusters are formed at uniform intervals during the processing 
of the collection, that is, cluster 1 forms with document 1 and in general 

34 
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N 

cluster i forms with document — (i-1) + 1 and so forth, where there are 

m 

N documents and m clusters produced. Under this assumption, the number 
of comparisons for clustering on a single level is given as follows: 



C = N ~ 1 l i = (N ~ l)(m + 1) 



m 



i=l 



where 



= number of comparisons on level 1 
N = number of documents in the collection 
m = number of clusters produced 



If clustering is done on x levels and p clusters are formed at 
each level from each cluster on the next higher level, the number of compari- 
sons made at level i is 



n _ (N - l)(p + 1) 

i ' 2 



Consequently, for multilevel clustering over x levels, the total number of 
comparisons C is 



_ y c _ (N - l)(p 4- 1) 



i=l 



x 



and since m = p , 



„ (N - l)(p + 1) log m 

c = 2 p 



Dattola gives a similar derivation in complete detail [4] , giving 
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the total number of comparisons D over x levels as defined above: 
D = kNp log-in 



where 



k = the average number of times a set of documents is compared 
to a set of cluster centroids. 

A typical value of k is 16 for a collection the size of the Cranfield 424. 

Based on vector comparisons, then, one could expect an increase in 

3 

speed over Dattola* s algorithm of 2k for large p and y k for p = 3 
(the optimum value proposed by Dattola [4]). A somewhat lower ratio is 
shown in the results of this study. Table 1 shows comparative results. 

For the ADI collection the ratio of number of comparisons is 15:1, implying 
a k for Dattola* s algorithm of 7.5. For the Cranfield 424 the values are 
8.2:1 and 5.25, respectively. The difference from the value k = 16, which 
is expected, is largely explained by the following factors: 

1. In the runs using the present algorithm, a burst of 
clusters was forced to form at the outset. Thus the uniform 
formation assumption is not met and more comparisons are 
made with the present algorithm than predicted. In the 
limiting case where the first m documents form the m 
clusters, C is bounded as follows: 

C < (N - l)p log^m 

2. During an iteration of Dattola ‘s algorithm the number of 
trial centroids is frequently less than the chosen value 
of p . 
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ADI Collection (82 documents) 


Present 

Study 


Dattola * s 
Algorithm 


Number of clustering runs over 
which the following results are 
averaged 


11 


1 


Vector comparisons 


445 


6614 


CPU seconds (360/65) 


5.8 


44.0 


I/O seconds (360/65) 


13.3 


38.0 



Cranfield 424 Collection* 
(424 documents) 


Present 

Study 


Datt ola's 
Algorithm 


Number of clustering runs over 
which the following results 
are averaged 


1 


1 


Vector comparisons 


5579 


45840 


CPU seconds (360/65) 


214 


439 


1/0 seconds (360/65) 


126 


653 



Clustering Cost Comparisons Between the Present 
Study and Dattola* s Algorithm 



Table 1 



*Re suits shown for Dattola are for two-level clustering 
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3. Iterations of Dattola’s algorithm sometimes converge before 
the iteration limit is reached. 

4. Two levels of clustering were performed on the Cranfield 
424 under Dattola’s algorithm against a single level for 
the algorithm of this study. 

The comparison of CPU times in Table 1 is somewhat less favorable 
to the algorithm of this study, particularly on the Cranfield 424 collection. 
The major factor is scoring, or vector comparison. Scoring may be done over 
an array of weights if concept numbers are restricted to a prescribed range. 
In the case of Dattola’s implementation, numbers do not exceed 10,000 so 
comparison may be done over a 10,000— element array in which the concept 
number is given by position rather than in a list where the concept numbers 
appear explicitly. The present algorithm uses such a list, ordered by 
concept number, and runs more slowly as a consequence. 

Perhaps the most important point to be made in this discussion is 
that the algorithms of Dattola and of the present study are of the same 
order. The constant multiplier k , however, is of the order 16 in the 
case of Dattola and 1 in the case of the present study. 

B) Effect of Document Ordering 

The effect of ordering is studied by comparing the search results 
on the original ADI collection and three permutations of it. The three per- 
mutations are constructed by reordering the collection according to tables 
of random numbers between 0 and 99 . 

Clusters are generated using NCEIL = 22, NCLREQ = 9 and 0VLAP = .122 
(default options are used for the remaining parameters). A minimum of 10 
documents is searched for each query. Plots of precision vs. recall are 



o 
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shown in Figures 3 and 4. As expected, there is significant variation in 
the performance of the algorithm for different orderings of the documents. 

It is interesting to note that the original ordering of the ADI collection 
gives the worst results. The third permutation gives the best results which 
actually exceed Dattola’s results except in the high recall region. It can 
be seen that the relative improvement of the results by reordering is better 
in the document -level average plots than in the recall-level average plots . 

Figures 5 and 6 show the spread between the curves representing 
the original ordering and the third permutation of the documents of the ADI 
collection. The results can be compared with those of a full search and 
clustered search using Dattola’s algorithm. 

Multilevel clustering is discussed in Section 6. C. Since multi- 
level clustering improves search results, it is possible that document re- 
ordering may be used to obtain further improvement in this case , although it 
would be difficult to determine a suitable ordering in advance. 

C) Search Results on Clustered ADI Collection 

Recall and precision plots of searches on the clustered ADI collec- 
tion are shown in Figures 5 and 6. As discussed in Section 6. B, it may be 
seen that there is a substantial variation in the quality of the. clusters 
over different orderings of the collection, as measured by search results. 

In comparison with both the full search and Dattola’s algorithm, the present 
algorithm shows a tendency to perform best in the low-recall region. This 
effect may be observed in all results of this study and, consequently, it 
is a distinguishing characteristic of the single pass algorithm. 

It should be observed that search, costs for the results shown using 
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Effect of Ordering on Document Level Averages in Search 
of ADI Collection Clustered with Single Pass Algorithm 

Fig. 3 
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.8 



— v— Original Ordering 
— o— First Permutation 
— o— Second Permutation 
—a— Third Permutation 




Effect of Ordering on Recall Level Averages in Search 
of ADI Collection Clustered with Single Pass Algorithm 

Fig. 4 
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— o— Full Search 
— o— Dattola's Clusters 
— v — Single Pass-Original Ordering 
— a — Single Pass -Best Random Ordering 




Document Level Averages for Full Search, Dattola and Single- Pass 
Clustered Search on ADI Collection 

Fig. 5 
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— o— Full Search 
— o— Oattola's Clusters 
— v— Single Pass -Original Ordering 
—a— Single Pass -Best Random Ordering 




Recall Level Averages for Full Search, Dattola and Single-Pass 
Clustered Search on ADI Collection 

Fig. 6 




43 




XII-25 



the present algorithm and the results using Dattola’s clusters are roughly 
similar. For example, 990 vector comparisons are made in searching the 
clusters produced from the ADI collection as originally ordered compared with 
966 vector comparisons in searching Dattola’s clusters. 

In Section 6. A an expression for the cost of clustering is given. 

The relationship is such that clustering cost is reduced both for the algo- 
rithm of this study and for Dattola’s algorithm if multilevel clustering is 
done.* It is not known how the search results on clusters formed at a single 
level and over multiple levels compare in the case of Dattola’s algorithm. 
However, in the case of the present algorithm, multilevel clustering is not 
only less expensive to perform but it can produce markedly better clusters. 

Figures 7 and 8 show the improvement in recall and precision achieved 
when the same ordering of the ADI collection is clustered over one level and 
two levels. Clusters are formed at the first level and then sons of each 
are formed at the second level. The ordering chosen was the one among the 
four tested for which recall and precision are poorest for the single level 
clusters. It is suspected that improvement would also be shown for the other 
orderings . 

It can be noted in Figures 7 and 8 that the two-level clustered 
search is markedly better than the one-level search, particularly in the- 
low recall region. Moreover, the two-level search performs better than the 
search on Dattola’s clusters in the low recall region and approaches the 
full-search curve in this region. 



*It is thought proper to apply the description ’’single pass algorithm” to 
multilevel clustering where (a) each level is clustered in a single pass 
and (b) the multilevel algorithm performs fewer comparisons than a single 
pass would perform as a single level. 



44 



XII-26 



ERIC 



Precision — 0 — Pull Search 




Document Level Averages for Full Search, Dattola, One- Level 

and Two- Level Clustered Search of ADI Collection 

Fig. 7 
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In general, the probability that a document late in the input enters 
an early cluster is related to the number of clusters formed in the pass over 
a level of the cluster tree. The greater the number of clusters, the more 
the membership of early clusters is confined to documents early in the collec- 
tion. This can be seen by observing that, except for the case where a collec- 
tion is partitioned into m sequential clusters, a cluster on the average 
allows documents to be admitted over a sequence of input documents larger than 
the cluster size, that is, for a single level of clustering. 




where 



n = number of documents for which a substantial probability of 
admission to a certain cluster exists. 

a = a constant (a > 1). 



If clustering is done over two levels. 
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The above suggests that producing as few as two clusters from 
each father in the cluster tree would allow the best associations to 
be made. It is also suspected that better results may be obtained if the 
order in which the documents are compared at each level is reversed. The 
rationale here is that the last documents admitted to the cluster have 
in general the highest correlation with the centroid. It is hypothesized 
that in a pass in reverse order, these documents will tend to form a single 
cluster and allow the earlier ones to fall in separate groupings. 

The search results reported above suggest that sets of clusters 
with similar size statistics have similar search costs when the same search 
parameters are used. Comparative figures are shown in Table 2. 

Overlap figures are also given in Table 2. It may be seen that 
overlap varies substantially between the sets of clusters compared. Cer- 
tainly, increased overlap increases search costs since it increases average 
cluster size. Whether increased overlap affects recall and precision 
curves to any great extent is less clear. It may be argued that the clus- 
tering algorithm operates without knowledge of the query set subsequently 
used to search the collection and, consequently, assignment of documents to 
multiple clusters is independent of relevance judgments. Unpublished results 
of Dattola in which overlap was varied widely when clustering both the ADI 
and Cranfield 200 collections without apparent effect on recall and preci- 
sion support the hypothesis of independence. 

The results of this study as well suggest that overlap is uncorre- 
lated with recall and precision. Overlap is not held constant in the re- 
sults presented because of the difficulty in matching the overlap measures 
of Dattola* s algorithm and of the algorithm of this study. As may be seen 
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in Table 2, higher overlap values occurred with the better results on the 
ADI collection, but the reverse is true for the two runs made on the Cran- 
field 424 collection. 

D) Search Results of Clustered Cranfield Collection 

The Cranfield 424 collection is known to have sequential groupings 
of some of the documents relevant to certain queries. Consequently, a ran- 
dom ordering of the collection is used as input to the single level clustering 
run performed with the algorithm of this study. Figures 9 and 10 show the 
recall and precision curves from a search on this set of clusters compared 
with a search on a set of clusters produced with Dattola‘s algorithm. In 
the case of Dattola‘s algorithm [4] clustering is done over two levels 
using clustering parameters which were found to be optimal on the ADI and 
Cranfield 200 collections. 

As may be seen from Figures 9 and 10 the algorithm of this study 
produced a slightly better recall and precision curve than Dattola*s. 

Additional runs on several permutations of the collection are needed to 
establish whether a significant difference is shown consistently. Search 
costs on the Cranfield 424 collection using the single pass algorithm are 
comparable to search costs using Dattola's clusters. As shown in Table 2, 
the present algorithm requires 13,103 vector comparisons, compared with 
12,200 for Dattola's case. A more useful measure, which allows comparison 
of collections of different size, is the fractional search cost, defined as 
the number of vector comparisons per document per query. Fractional search 
costs on the Cranfield collections for the single pass method and for Dattola's 
are roughly similar, as seen in Table 2. 
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7 . Conclusions 

The single pass algorithm of this study is substantially less costly 
to execute than other clustering algorithms, Dattola’s in particular. As 
may be expected, however, the quality of the cluster set depends to a large 
degree on the order in which the documents are encountered by the algorithm. 

On the ADI collection, some orderings produce search results, measured by 
recall and precision curves, better than Dattola’s clusters in the low recall 
range. Other orderings are worse at virtually all points on the curve. 

A single clustering run is reported for a larger collection, the Cran- 
field 424. It indicates that cluster quality is not degraded when the algo- 
rithm is applied to a collection about five times as large as the ADI. 

Multilevel clustering using the basic single pass approach of this 
study is shown both to be cheaper and to produce substantially better clusters 
as measured by search results. There is a strong suggestion that further 
work could establish the multilevel single pass algorithm to be as good or 
better than Dattola’s algorithm for most orderings of the collection. 

The basic limitations of the single pass method appear to be overcome 
best when multilevel clusters form a binary tree. It is possible that the 
contents of a collection would dictate nodes in the tree of higher degree . 

The trade offs involved in such cases should be investigated. 

The multilevel clustering of this study is confined to presenting 
only the lowest level of the tree to the search algorithm. However, the en- 

i 

tire tree could easily be made available to the search algorithm. If so, two 

i 

possible wkys of construction present themselves. The first is to fetch each 
document description from the collection only once. Each document would be 
passed 'down the tree and all decisions relative to it would be made in sequence, 
level by level. In effect, many levels of single pass clustering would be 
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carried on simultaneously, each cluster at each level being treated as the 
collection from which its sons are formed. In the case of a binary tree, 
for example, the first document in the collection would be passed down the 
entire leftmost branch of the tree to the predetermined lowest level. It 
would define, at first, the leftmost cluster at each level. The next docu- 
ment would then be passed down the tree. As long as it was admitted to 
existing clusters it would add to their definition. At the point (if any) 
where it was selected to define its own cluster, it would form a right 
branch and then a sequence of left branches down to the lowest level. Sub- 
sequent documents would be processed similarly. 

The second method of construction is to form each level completely 
before the next level is begun. Document descriptions would have to be 
fetched from the collection once for each level plus additional occurrences 
caused by overlap. However, cluster quality might well be better since the 
direction of passing over the documents can be reversed at each level. 

Even when just two sons are formed from each father in the tree, 
there still exists the possibility that natural clusters will be split 
into fragments. By the nature of the process, once two documents are 
separated, they cannot be associated again. To a certain degree, searching 
over multiple clusters allows these documents to be found. It would be best, 
however, to have them properly associated in the tree. It is proposed, 
therefore, that the leaves of a completed tree be compared one to another. 
Those with particularly high correlations would be coalesced into a single 
cluster taken to be the son of both fathers. Such a coalescing process 
would deform the tree only at the lowest level and could be expected to 
reassociate sets of documents which were of roughly equal size and large 
enough to be a majority of the members of the clusters involved. Any 
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further investigation will necessarily include experiments designed to 
strengthen, if possible, the results already found and to consider further 
the selection of optimum clustering parameters. 
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A Systematic Study 

A 



of Query -Clustering Techniques: 
Progress Report 

S . Worona 



Abstract 

An experiment using various techniques of query clustering on the 
Cranfield 424 document collection is described and some preliminary results 
are given. Several methods of evaluating the performance of clustered 
searches in the context of query-clustering are discussed. Finally, some 
observations are made concerning use of the SMART system as implemented 
at Cornell University. 

1 . Introduction 

The idea of query-clustering as an aid to information retrieval 
systems is first defined and examined by V. R. Lesser [1] . In that report, 
a two-level clustering algorithm is described in which members of a docu- 
ment collection are assigned to clusters according to their relationship 
with previously-formed clusters of queries. 

It is argued that there are three advantages to performing the 
clustering in this manner; first, the accuracy of a given search procedure 
may be increased by comparing queries to sets of related queries already 
processed by the retrieval system, instead of sets of related documents. 
Second, it is likely that such a system will perform better as time 
passes and more queries are available for clustering. Finally, because 
the cost of most clustering algorithms increases with the size of the 
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collection being clustered, it is more economical to use a query collection 
than the associated — and much larger — document collection.. A more 
thorough general discussion of the first two of these points can be found in 
[2], as well as in the original paper by Lesser. Applications of these ideas 
to various methods of query clustering are discussed below. (For additional 
information bn clustering algorithms, see [3, 4, 5, 6, 7, 8] . Background material 
and further references may be found in [9] . ) 

The general process of query clustering, may be divided into three 
parts, or ’'phases": 

1) generation of query- clusters; 

2) generation of document clusters from the query clusters of 
phase i; and 

3) definition and assignment of centroids for the phase 2 document 
clusters . 

Each of these three phases may be accomplished in several different ways . 

A combination, of three such methods —that is , one for each phase — is termed 
an "implementation" of query clustering, or a particular query-clustering 
technique.. 

Many procedures exist for performing phase 1, that is, in the initial 
clustering job. The variables in using these algorithms include the number 
of clusters desired,, the amount of overlap permitted, and the number of queries 
to be clustered. The. last parameter is particularly important to a query- 
clustering technique, because; it is hoped that search results improve with 



an increase in the number of .queries clustered. 

Phase 2 may be implemented in any of the following three ways: 
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1) Relevancy decisions 

2) Correlation with query centroids 

3) Correlations with clustered queries. 

In the first case, documents which are relevant to one or more queries 
in any one query cluster are grouped. For case 2, all documents are 
correlated against each query centroid, and those documents correlating 
highly with any such centroid are put into one cluster. In the third 
case, the documents are correlated with all queries used in clustering, 
and a document cluster is formed from those correlating highly with one 
or more queries in any given query cluster. In ail three methods, one 
document cluster is formed for each query cluster generated in phase 1. 

The query cluster from which such a document cluster is formed (whether 
by relevancy decisions, centroid correlations, or query correlations) 
is called its underlying query-; cluster . 

There is much disagreement about the manner in which a centroid 
is chosen to represent the documents in a cluster. Concepts may be 
logical or weighted,, with very high or very low weights arbitrarily either 
retained or dropped by one of several methods [10] . In query clustering, 
however, the choice of the type of centroids used (phase 3) is much more 
basic — the centroids for each phase 2 cluster may be either the document 
centroid formed from the documents of the cluster, or the query centroid 
of the. underlying query -cluster . 

There are obviously a large number of query-clustering techniques 
which may be formed from. different combinations of the above variations 
of the three phases . At this time, the only available studies of query 
£ius&&-i$g .atfife- f pcused -on- varying. :pHa'se 1 methods, while using relevancy 
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decisions for phase 2 and query centroids in phase 3. This paper deals 
with an experiment which considers all six combinations of second- and 
third-phase variations, while also using different numbers of queries in 
phase L? This produces eighteen different query-clustering techniques, 
which are then compared to a standard full search, and also to a set of 
"normally-generated" clusters (formed using Dattola 1 s Algorithm). 

Because of the large volume of data generated by the experiment , a 
thorough analysis of the results is not yet available. The present report 
will be followed by such ah analysis, using the parameters developed in 
[ 2 ]. 



2. The Experiment 



A) Splitting the Collection 

In all experiments with query clustering, it is first necessary 
to divide the query collection used into two disjoint subcollections : a 

cluster-set, and a test-set . (The collection used in this case is the 
‘ 155-query set associated with the Cranfield 424 documents . ) In general, 
the cluster-set provides the queries for clustering; these clusters are then 
used to generate phase .2 document clusters.. When a tree (that is, a 
hierarchy Of documents and centroids) has thus been formed, the test -set 
queries are used to determine the performance of the particular method used. 
This simulates the action of an actual informat ion -retrieval system, and 
makes clear the requirement that the two query- sets be disjoint. (Since 
tone Of the hypotheses being tested states that new queries entering a 
':system benefit by the presence of similar queries already processed, it 
,wouid be unrealistic to. test methods of query clustering which allow the 
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"hew” queries to be present in the system already.) 

Because of the nature of the query collection used, another 
consideration in this splitting is the authorship of the queries. 

In a real, system, it is unlikely that a given author would submit 
two queries with similar sets of relevant documents. (If this were 
the case , relevance -feedback from the results of the previous query 
should best be used in handling the later query.) The Cranfield 
queries, however, contain many instances of authors’ submitting several 
queries with similar sets of relevant documents. It is not unreasonable, 
then,, that in splitting such a test collection into two sets of queries, 
it should be required that author^sets hot be broken . That is , if any 
query of a given author appears in one of the sets , then all queries 
of that author are put into that set. 

With these considerations in mind, the 155 queries are. split 
info two author-consistent sets of 30 (test-set) and 125 (cluster-set ) 
queries each. This is done by generating random numbers between 1 
and- 155, and including in the test-set those queries whose numbers are 
drawn at random , as well as all other queries by the same author . 

Random numbers corresponding to queries already selected are passed 
by in subsequent drawings . Appendix A, Table A1 gives the results of 
this splitting , including author number for all queries selected. 

.After the 1 generation of the test-set , the cluster -set is 
formed from the* remaining queries. In order to allow the experiment 
t,6 test the effect of enlarging the base of clustered queries, the 
i25^item.: cluster^set is subdivided randomly into sets of 75 and 100 
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queries, such that the first is a subset of the second. Three cluster-sets — 
called CS1, .CS2, and CS3 — are thus formed — with 75, 100, and 125 queries, 
respectively — where the increase in size from one to the next is caused only 
by the inclusion of additional queries. This, too, mirrors the action of a 
real system, where an increase in queries processed is caused by addition, 
rather than by reformulation. Table A2 in Appendix A lists the queries 
included in these three cluster-sets. 

B) Phase 1: Clustering the Queries 

The three sets of cluster-queries formed by splitting the Cranfield 
424 collection queries are clustered using Dattola 1 s algorithm (see [7])* 
Essential to this algorithm is a specification of the number of clusters 
desired and the amount of overlap permitted ; The results of the experiment 
in [2] indicate that overlap in such: a set of query clusters in greatly 
magnified when the related document collection is formed — particularly, when 
relevance decisions are used in phase 2. it is apparent, moreover , that 
overlap will also be increased by most other implementations of phase 2 . 

Since the overlap obtained in [2] was far too great , it was decided that the 
query clusters formed here should have no overlap. 

Not so easy to answer is the question of how many clusters should 
be formed. This is a problem in any one-level query-clustering technique. 
First , as many query-clusters must be formed as the number of desired 
document clusters i Furthermore , the number of queries to be clustered is 
generally far less than, the number of documents. Thus, the number of clusters 
cannot be optimal, for both, the queries and the documents . According to [8] , 
the best number of equal-sized clusters which can be formed from n, items is 
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of the order of Thus, for the 424 documents of the Cranfield 

collection, approximately 21 clusters should be made, while the three 
cluster-sets of queries require, roughly, 9, 10, and 12 clusters, 
respectively. One solution to the problem is to abandon the single- 
level hierarchy in favor of a multiple-level tree, where the clusters on 
level 1 are formed by breaking up the query-clustering-geherated clusters 
oh ievel 2. (see Figure 1.) This method is currently under investigation 
by Magliozzo and Bodenstein [11] . 

Because this experiment is not designed to consider such variations 
in query clustering,, a solution other than that mentioned above is desir- 
able. A compromise, therefore * is made between the different optimal 
numbers, of clusters required by the four collect ions. Dattola ’ s algorithm 
is asked to provide 15 clusters for each set of queries and documents to be 
clustered. The 15 clusters of the 424 documents thus generated are later 
used as a ’’.standard clustering” against which the query-clusterings are 
compared. (The actual generation of so precise a number of clusters is 
hot a trivial matter j as is discussed in Appendix C.) The results of 
using, Dattoia ’ s algorithm to cluster these collections are given in 



'Appendix A, Tables A3 and A4. 
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C) Phase 2: Clustering the Documents 

* v 

In general, the most convenient way to implement phase 2 consists 
in using relevancy decisions. This is done by assigning each document to 
the cluster or clusters whose underlying query cluster(s) contain(s) one 
or more queries to which it is 'relevant . Unfortunately , this process deals 
with only part of the documents in the collection, in most cases. Although 
each document in the Cranfield 424 collection is relevant to one or more 
queries in that collection, hot all of these queries are present in any of 
the cluster-sets CS1, CS2, and CS3. The documents not assigned to clusters 
by analysis of relevance may be called ‘’loose documents” (see [8] for a full 
description of the term), and must be “blended” into the clusters already 
formed. In all three cases the number of loose documents is rather substan- 
tial: 135 in GS1,, ’89 in CS2., and. 54 in CS3. (it should be noted here that, 

because of the relationship between these three cluster-sets, the loose 
documents of CS3 are a subset of those of CS2, which are, in turn, a subset 
of those of CS1 . ) 

Because of the large numbers of loose documents , the method used to 
incorporate such documents into particular clusters is quite important , and 
must be the subject, of carefui scrutiny in any actual use of this method. 

For the present experiment, however, where the major object is the examina- 
tion of the results of varying aspects of the clustering scheme other than 
- -the- .blending 4e.tjhody .an- arbitrary way of assigning loose documents to clusters 
is chosen. The method consists in correlating each ioose document with the 
15 centroids of the given cluster-set, and to add each. document to the document 
cluster(s),, for which, the centroid of the underlying query cluster(s) satisfies 
one of the following conditions: either the correlation between query cen- 
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troid and document is 0.250 or higher, or the centroid-document correlation 
is higher than the correlation with all other query centoids. The reason 
for this method will become clear when method 2b is discussed below. 

Results of this assignment — relevancy combined with blending — are given 
in Appendix A, Table A5. 

It is this type of clustering which has been studied previously, 
and which seems most likely to produce improvements over standard algorithms 
which do hot use query clusters. The effect is to classify documents accord- 
ing to the questions to which they relate, rather than according to similari- 
ties in word content. It is, moreover, this method which seems likely to 
exhibit the most improvement When a larger cluster-base is used . If such 
a trend could be confirmed, this type of query-clustering would produce 
a way for an. information-retrieval system to "learn” from its past successes, 
while keeping it from repeating past mistakes., it might also be a way of 
implicitly altering a system, to compensate for changes in terminology , or 
to anticipate the development of new fields of information . For these 
reasons,- this form of query clustering may have the same type of advantages 
as document -space modification, a technique examined in Brauen [12]. 

Type 2 document; clusters are formed according to correlations between 
documents and- centroids of the clusters formed in; phase 1 . In some ways , • 
this might be looked at; as a standard clustering algorithm Which begins with 
certain clusters already formed,, and continues by blending into these the 
set, of. documents to be clustered. It might eventually be shown experimentally 
that using; such centroids as a "seed, collection” in any of the standard 
algorithms will :produce improved performance from the- final clusters . 

In the experiment at hand,, the procedure is the following for 
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phase 2 using type 2 clusters: all 424 documents are correlated with all 

15 query centroids in each cluster set. A document is assigned to any 
document cluster, for which the centroid of the underlying query cluster 
satisfies either of two conditions: 

a) the centroid and the document correxete at a level of 
0.250 or higher; 

b) the correlation with the document is higher than that 
achieved by all other centorids. 

Note that this is the same condition under which a loose document is blended 
into a cluster for method 1. This criterion is chosen, by inspection, to 
approximate the size, and degree of overlap of What might be considered 
"standard clusters”. While such an arbitrary cutoff value is likely to be 
used in any operational implementaioh of this method, the cutoff may, of 
course, be varied to achieve different clusterings. In appendix A, Table A6 
it may be noted, that the sizes of the clusters formed vary greatly, from a 
minimum of 4 by CSl to a maximum of 85 by CS3. This is clearly an undesirable 
result , and a method of avoiding it is suggested below . (Varying the cutoff 
might reduce the problem, but would probably not solve it entirely.) 

It should be noted here that method 2 is unlike the previous one in 
that ho loose documents result.. This is due, of course, to taking each 
document ind ividually and assinging it to one or more clusters. The problem 
of non-uniformly-sized clusters may be solved if the generation of loose 
documents is .permitted, Instead of correlating documents against centroids, 
it is possible to reverse the process, matching centroids against documents. 
Xh. this variation of method 2 , the top n , say , correlants of each centroid 
are chosen for inclusion in the document cluster related to that centroid . 
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Thus, all clusters have the same number of items. By varying the cutoff 
(n), and imposing additional restrictions on correlations of included 
documents, it seems likely that interesting results could be obtained. 

This range of experimentation is not, however, included in the current 
study. 

Finally, the third form of phase 2 is achieved here by correlating 
all of the 424 documents against the 75, 100, and 125 queries of the three 
cluster-sets. Documents are assigned to any document cluster whose under- 
lying query cluster contains one or more queries such that either a) that 
query correlates more highly with that document than any other query, or 
b) the correlation between the query and the document is 0.250 or more. 

The motivation for this choice of method is the same as that for the type 2 
method, and the same comments apply. In this application, also, a 
disparity arises in cluster sizes. Table A7 of Appendix A shows that 
cluster-set CS1 produces both the largest cluster (ill documents) and 
the smallest (3 documents) generated by method 3. 

As before,, a reverse strategy may be used which would ensure .an 
even distribution of the documents throughout the clusters (aside from 
the blending of loose documents). 

This method (in either variation) may be regarded as ” pre -search ing” 
the document collection in order to make later searches more effective. 

If, as assumed, many new queries are similar to queries already present in 
an ; information-retrieval ‘system, then such a new query should easily find 

the cluster associated with these similar queries. This method of 

\ * *> 

.assigning documents to cluslers thus guarantees that the documents in 
that cluster are those which correlate highly with the old, similar queries 
'(and-, thus,, hopefully, with the new query). 
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Both of these last two implementations of phase 2 are inherently more 
expensive than the first. Relevant documents for a given already -processed 
query can be easily selected, without the necessity of correlating large 
numbers of concept vectors. In the case of method 2, each document must be 
correlated against as many centroids as there are query clusters. Method 3 
requires a full search of all queries against all documents, although this 
would be done only once for each document and query. For method 2, a new 
correlation by all documents would be required with each update of the query 
clusters. Further analysis of comparative costs of these three methods is 
possible, but beyond the scope of this paper. 



D) Phase 3i Assigning; Centroids 

The two methods of assigning centroids to document clusters are quite 
straightforward. In one case , the centroid is taken as that of the underlying 
query cluster. In the other., it is the standard centroid def ined by the docu- 
ments in the cluster. 

/ , * 

Note that using document centroids generally requires an additional 
series of computations while the use of query centroids does not. This is 
the. case' because,, as a rule, the process, of query-clustering in phase 1 pro-r 
duces the query centroids as a side-effect., at no additional, cost , in addi- 
' Hrlony '-query- dehtxjoidsi; teiid: to be small, taking up less storage space within 

the machine than document centroids . On -the other hand, it may be that docu- 

, O' ' /■ 

irient centroids — which, contain tipre of the information about the documents 

\ „ * Ts ' < , T 

•they represent ^ form a better vehicle for combining, documents than qu :>ry 
centroids. Even the best clusters will achieve poor performance if the cen- 
troids are podrly defined — see Section 4 of this paper nr so that this choice, 
also j is crucial. 
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E) Summary 

The diagram of Figure 2 indicates the variations used in forming 
18 cluster trees from the Cranfield 424 collection. The cluster-collection 
names used in the rest of this paper may be drived from this diagram by 
concatenating the three keys describing the particular collection. For 
example, the collect ion formed from CS2 (100 queries) using documents 
assigned by method 2 (centroid correlations) and query centroids in phase 
3. is 1 lOOCCOQC 1 . 



3 . Results 

For the most, part , the results of this experiment are unexpected . 
(Graphs of recall -precision values for all 18. clusters are given it. Appendix 
B ,. together with, graphs for the standard" clusters and a full search using 
the test-set queries. ) Consider., for example, the six comparisons which 
may be made varying only the number of queries clustered . (The list , 
including the. names introduced in Figure 2, is given in Table 1 beiow.) 
Intuitively, the expected ranking ir. all cases is 125-100-07.5, best to worst 
In only one case put of six, however* is this order achieved: clusters 

125QC0QC* 100QC0QC, and 075QC0QC.. (See Figure B7 . ) In four of the five 
other cases, the, clusters Using (100 queries) performed best. (In the 
remaining case, nnnRELQC, GS3 was best , but. CS1. was second-best Instead of 
last,.,)' Moreover, only three of the eighteen cluster-sets produced better 
results'. than, 'the .^.standard" clusters (clusters lOOBJELDC in Figure B2 and 
lOOCCODG and 075CC0bC in Figure B3. ) Reference [2] predicts a different 
result. , ’ v .' _ . 

/T)ie v ^ptlahatlc>hs for fhese '•resuits--must await further analysis . At 
present, Some'-observatiohs may be given. It must be noted first that the 
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decisions on the rankings of two of more search results is being made on the 
basis only of cursory analysis of the recall-precision graphs presented in 
Appendix B. On that basis, for example, clusters 100RELDC (Figure B2) are 
.being called "better” than the standard clusters of Figure Bl, even though 
the latter rises, above the former from recall level .35 onward. A more 
thorough investigation of these graphs is needed to reach firm conclusions 
concerning the preferred clustering method . 

The possibilities of experimental errors must alsd be considered. 

In Appendix C the procedures used in setting up. all of these collections are 
described;. Because of the large amount of handwork that went into this stage 
of the experiment, and because, of the lack of complete verification of the 
resulting lists, it is possible that some degree of error has been intro- 
duced in this way. 

The: final point to be mentioned here concerns the search strategy 
used; \In> :E2]lj it is- necessary 'to- e.dmpenSatfe- for unusually large clusters 
by varying the number of clusters Expanded; during searching. Since this is 
not a. problem in the present experiment j a fixed number of clusters is ex- 
panded Ih ali searches .. -It is possible that this number does- not allow 
proper- representation of the properties of the existing collections. Future 
searches; will include fixed expansions of different values, along with ex- 
pansions which vary with correlation and number of documents searched. This 
• |as4 consi^^ seemsumost ^promising 4$' $n pxplchiatlon of the results 
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search, it is possible to isolate the cause of good (of bad) performance to 
one of three areas: 

1) Cluster generation 

2) Centroid definition 

3) Search strategy . 






When, the search is done without clustering , only the third explanation 
applies... In. add ition, the amount of work done in searching a clustered 
collection .may vary from query to query and from, one implement at ion to 
another, and should also be measured. Although such a measure is sometimes 
included, in. the basic performance of a. system, it seems clear that the two 
areas of analysis are quite separate and should be treated as such. 

In general , search results are compared to an "optimal" system: 
one which produces all relevant documents for all queries ranked at the 
top of the list of retrieved items. lit such a system the recall -precision 
graph achieves a horizontal line at y-intercept 1 (see Figure 3) . In the 
same way,, it is psssible to rank the. first-level .clusters in any clustered 
collection for each query, in order of "desirability" . (Note that the 
analysis which follows is not directly applicable to multiple-level 
"ciusterings:; Thls.|rc^ is 1 discussed in [10] and [13] i) 

'Giveh the -configuration of Figure 4.,. the hoped-for ordering of 
/these clusters before/ expansion is obviously (high to low number of relevant 
document s); ,D-C-A-B-E.; Note that two considerations are involved here: 

,/Flrst j it is required that the clusters containing the greatest number of 
, nelsyanf, documents: .he; ranked. •highes.ti. Second, between two clusters whose 
Jq6^t'ent;S; ;oi- ;dOC^ent.s' are the" same, the smaller should be first. 
The. dfiflnitlqn of the Centroid f of. eaCh cluster determines how high each 
otofer , ranks; j and that, definition wiil det ermine whether a, search does 
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well (or poorly) because the proper clusters are (or are not) expanded. 

Consider now a search which orders the clusters of Figure 4 in their 
"optimal" ranking. This search will, in general, perform less well than a 
similar one. with the clusters of Figure 5, where the clusters are also consid- 
ered to be ordered correctly before expansion. The reason is clear: in 

the former case it is necessary to examine 135 documents before all relevant 
may be included,, while with the second group only 80 document correlations 
must, be made. The cluster-set exhibited! in. the second set. possesses as few 
honrelevant as possible for those clusters containing all relevant. This 
property is determined by cluster generation, as apart from centroid genera- 
tion; and. search strategy;. 

Three : parameters are> introduced: in [2] for dealing with these concepts. 
They are "aim!',, "target", and: "rejection" ,. defined as follows.: Given a query 

£ with n relevant documents. It is assumed that a clustered document collec- 
tiph is to be evaluated according to its clustering success and its achieve- 
ment in centroid definition. For each, number jc of clusters expanded, 

a) they, aim; clusters are, those _c clusters tanked first by 
whatever correlation, procedure is used in ranking 
clusters* and 

b) ; the target clusters, are those '£ clusters ranked first 

according to- the previousiy-discussed considerations 
of a; number of relevant contained and size. (For a 
•• more precise'description, including, the question of 

■ • how "documents appearing in more than one cluster are 

to be treated , see [ 2] ,. ) ■ 
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and the target value is 



number of relevant documents in target clusters 

£ 

Similarly , rejection is defined as 

occurrences of rel. documents in target clusters 

*-*»■**' ' , - * - - - . - • 

occurrences of rel. documents in all clusters 

In the measures above, optimal performance is achieved when both the aim-to- 
target-ratio (self-defining) and the .rejection are 1 when averaged over all 
queries. This is a restatement of the definitions of [2] , in a more precise 
form* 

The recall-precision graphs are included in Appendix B, and an ex- 
planation; of the programming tasks is given in Appendix C . 
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"Cran 1 * Number” Is the; . query ’s number in the. -424 collection 
" dr f ahi4v'-^ji$b i e^ ,t ‘ ie ;t : he query ’:s 'nunteei* in the 1400 collection, 
• '’’Author^ is ihe>a^ as given :by Granfieid> 
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Collection Name 


CSl 


CS2 


CS3 


No. of queries 
: ' \ in Tcoiiiect ion . • ; 
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- .'Number, of 

.. clusters formed.. 
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*Def ined as the: number of total occurrences of documents or queries 
throughout a clustered^ collect ion, divided by the number of distinct 
.'items in the- collections. - ' 
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Appendix C 
The SMART System 



In [14] the basic facts about the SMART system implemented at Cor- 
nell University are given. At the time of its design, however,, no large- 
scale use of query clustering, had. been attempted, and none was planned. 
Therefore, the experiment described in this report required some ad hoc 
.programming and some tiresome hand work. It is hoped that future experi- 
mentors with query clustering will benefit from the description given here 
of the procedures necessary to work within this implementation of SMART 
in order to.- carry out such work. 

No provision exists in SMART for performing the type of random- 
number generation and author-set-maintenance . described in Section 2. (This 
is by no means a deficiency of the imp lenient at ion since such a program is 
of little general, use •,.)> A .program, was therefore: written in FORTRAN to take 
the information about, authors of' queries and produce a randomly-selected 
set Subject, to the constraints mentioned previously. Such author information 
is readily available. . 

•At the- beginning: of, ..the Experiment no SMART procedure existed for 

- -■ * ^ - .- ■ • •' • 

forming a subcollection of a query or document collection included within 
the system. Another program was- therefore written (also in FORTRAN) to 
subdivide the Cranfield 424 collection’s queries into the four subsets 
necessary for the experiment (the test-set, and cluster-sets CS1, CS2, and 
CS3). Recently, however , D. M, Murray has written an addition to the SMART 
system which perforins the necessary subsetting within SMART. 

After creating the cluster_-sets CSlj CS2, and. CS3, .pat tola’s algorithm 
(implemented by the SMART procedure DCLSTR) was applied to these collections. 
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At the same time, the Cranfield 424 documents were processed by DCLSTR. 

As explained in the text, exactly 15 clusters were required in all 
cases. It is a property of Dattola’s algorithm that the number of clusters 
produced at any time is a function of the collection used, the random seed 
specified, and the number of clusters requested. It is thus not possible 
to predict with accuracy how. many clusters will be produced from any set 
of these parameters. To obtain exactly 15 clusters of the Cranfield 
documents it was necessary to use several attempts: cluster-set CS2 

required 4 tries j cluster- set C S3 needed 7, while only CS1 was success- 
fully divided into 15 clusters in just 1 attempt. A summary of these 
triaLL-and-erfor processes is; given in Table Cl. 

Method 1 of phase 2 was implemented by keypunching the numbers 
bf the documents relevant to each of the queries in each cluster of GS1, 
CS2, and CS3, and then sorting these three lists with another specially- 
written (although trivial) program , yielding a listing of the "non-loose” 
documents. The corresponding loose documents were listed by hand. It 
was originally assumed that methods 2 and 3 would be done by simple 
SMART searches, by using the 424 documents as queries against the three 
sets: of centroids and the three sets of queries. Tlis large number of 
"queries’’ proved unworkable in the system, and another program modifica- . 
tion was. required.. 

After the. definitions for ail 9 cluster sets were completed, it 
remained; to generate 18 centroid sets, and unite these two parts of the 
ultimafe^ collect ions. A SMART routine called CRDCEN has as its purpose 
this exact functioni The experiment was delayed, however, by the 
necessity to keypunch a great deal of information from the previous 
tabulations. In it. recommended that when a program is written to perform 
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; Cluster* Set j 


; Attempt 


Seed 


No. Clusters ; 
Requested 


No. Clusters 
Received 


csi 


1. 


' .12345 


15 


15* 


* CS2 : 


1 


; i 12345 ' 


15 


12 


;■ 


2 


> . 54321 


15 


; 13 




i 3 


.54321 


16 


14 


' ^ is 


• 4 


r i 54321 • 


. .. 17 


15* 


CS3 


1- 


* i 12 345 


; 15 ; 


14 




2 


; .54321. 


15 ’ 


; is 


i * ( < ' 


1 • 3 " ' 


: * 12 345 : 


16 


14 | 




4 i 


. ,.12345 


17 


14 


; 


< •• - 5 - 


; .13345 j 


18 


; 16 


* ' i 


6 


J 4 54321 ; 


18 


16 


** 


l ; 7 


.54321 


17 


. . 15* 


424 docs 


i . l ' 


....12345 ' 


: 15 


11 


{ , ' \ . ' ' • 


1 - 2 • 


s 123 4 3 ; 


18 


15* 



^satisfactory clusters 



Parameters Used in Generating Clusters, 
with Dattola * s Algorithm 

Table Cl 
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the required tabulation automatically, this should be done with a v..ew to 
obtaining the data and format required by CRDCEN, the process to which the 
results will eventually be passed. 

Eventaully, the entire process should be made a part of the SMART 
■system to be invoked like any standard clustering algorithm. 



